Regression Analysis with Python Reading 


Regression analysis is a set of statistical methods used to estimate relationships between a 
dependent variable and one or more independent variables. 


For example, suppose you’re a sales manager trying to predict next month’s numbers. You 
know that dozens, perhaps even hundreds of factors from the weather to a competitor’s 
promotion. You would employ some of the following regression techniques. 


Simple Linear Regression 

e This type of regression allows us to study the relationship between two types of 
variables. One variable is the independent variable (x), and the other is the dependent 
variable (y). 

e The formula characterizes this type of regression: Y = Bo + B1X +€, and is graphed as a 
straight line where: 

o X- the value of the independent variable, 

o Y-the value of the dependent variable 

o Bo-is a constant. 

o B1 - the regression coefficient. 

m (This tells us how much Y changes for each unit change in X). 

o €- The residual (error) term. This is the difference between the actual y and the 
predicted y. The predicted y is denoted as y. This error is evaluated for each 
observation. These errors are also called residuals. 

e A regression line can show a positive linear relationship, a negative linear relationship, or 
no relationship. 

e Simple linear regression is appropriate for modeling linear trends where the data is 
uniformly spread around the line; other modeling techniques need to be considered in 
other cases. 

e Example: 

o Relationship between sales and advertising costs. 

e Limitations 

o To understand whether one variable causes another, you will need additional 
research and statistical analysis. 

o This type of regression is prone to outliers which would lead to an inaccurate 
prediction by the model. 


Multiple Linear Regression 
e This type of regression is used to predict the value of a variable based on the value of 
two or more other variables. 
e We can also use this type of regression to identify the strength of the independent 
variables’ effect on a dependent variable. 


e The multiple linear regression equation is as follows: Y = Bo + B1X1 + B2X2 + ... + BpXp 
+E, 
Where; 

o Y is the dependent variable 

o X1 through Xp is the distinct, independent variables 

o Bois the value of Y when all independent variables (X1 through Xp) are equal to 
zero, and B1 and B2 are the estimated regression coefficients. 

o €- The residual (error) term. This is the difference between the actual y and the 
predicted y. The predicted y is denoted as Ŷ. This error is evaluated for each 
observation. These errors are also called residuals. 

e Assumptions (These are the same as for simple linear regression) 

o Linear relationship: A linear relationship is assumed between the dependent 

variable and the independent variables. 
Normality: The values of the residuals are normally distributed. 
Multicollinearity: The independent variables are not correlated with one another. 
Homoscedasticity: The size of the error in our prediction doesn’t change 
significantly across the values of the independent variable. 
No outliers: No compelling cases are biasing your model. 
The values of the residuals are independent such that one observation of the 
error term should not allow us to predict the next observation. 

e Limitations 

o This type of regression is prone to multicollinearity. This is the case where there 
are independent variables that are highly correlated with the dependent variable. 
This is because this type of regression assumes there is no relationship among 
independent variables. 

o Itis prone to noise and overfitting. If the number of observations is lesser than 
the number of features, it may lead to overfitting because it starts considering 
noise in this scenario while building the model. 

e Improving your model 

o Need more data: We need to have vast data to get the best possible prediction. 

o Wrong assumptions: We assumed this data has a linear relationship, but that 
might not be the case. Visualizing the data may help you determine that. 

o Poor features: The features we used may not have had a high enough correlation 
to the values we were trying to predict. 

e Applications 
Sales forecasting 
o Satisfaction Analysis 
Price Estimation 
Employment income 
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K-Nearest Neighbors (KNN) 
e K-Nearest Neighbor (KNN) is an algorithm that can be used for both regression and 
classification. Our session will focus on using KNN for regression. 


KNN is a non-parametric algorithm (which means that it does not make any assumptions 
on the underlying data distribution). 

It is also an instance-based algorithm that compares new problem instances with 
instances seen in training, which have been stored in memory. As such, it looks at the 
nearest neighbors to decide what any new point should be. 

It works by storing all available cases and using a similarity measure/distance function to 
calculate the average of the numerical target of the K nearest neighbors. Such distance 
functions can be either euclidean, manhattan, or Minkowski. 

While performing regression, we will have to choose the parameter k. How do we 
choose the correct value of K? 

We simply run the KNN algorithm several times with different values of K and choose the 
K that reduces the number of errors we encounter while maintaining the algorithm’s 
ability to accurately make predictions when it’s given data it hasn’t seen before. 

o As we decrease the value of K to 1, our predictions become less stable. 

o As we increase the value of K, our predictions become more stable due to 
majority averaging, and thus, more likely to make more accurate predictions (up 
to a certain point). Eventually, we begin to witness an increasing number of 
errors. 

It's commonly used for its ease of interpretation and low calculation time. 
Benefits 
o It’s easy to implement as there are only two parameters, i.e., euclidean and 


manhattan. 
o No training period as it learns from the training dataset. 
Limitations 


o If data has a considerable dimension, KNN becomes time-consuming. Hence, 
applying KNN when data has a considerable dimension is not advisable. By high 
dimension, this would be in the case where there is significantly a large no. of 
variables/features compared to the observations. 

o Feature scaling needs to be done before applying the algorithm; else wrong 
predictions could result. 

o Sensitive to noisy data, missing values, and outliers. 


Decision Trees Regression 


This is a type of supervised learning algorithm that can be used for classification and 
regression problems. Our focus will be its application to regression problems. 

This type of regression uses decision trees to split the data multiple times according to 
specific cutoff values in the features. Through splitting, different subsets of the dataset 
are created, with each instance belonging to one subset. The final subsets are called 
terminal or leaf nodes, and the intermediate subsets are internal nodes or split nodes. To 
predict the outcome in each leaf node, the average outcome of the training data in this 
node is used. 

We can use several algorithms to build decision trees, i.e., CART, ID3, C4.5, etc. 

We use decision trees where there are non-linear or complex relationships between 
features and labels. 


Benefits 
o It requires little data preparation. 
o Performs well with large datasets. 
o They require relatively less effort for training the algorithm. 
Limitations 
o There is a high probability of overfitting in the decision tree. 
o It gives low prediction accuracy for a dataset as compared to other machine 
learning algorithms. 
o Asmall change in the training data can significantly change the tree and, 
consequently, the final predictions. 


Support Vector Machine - Regression 


This is a regression algorithm that can be used to work with continuous values. 

We can use it to avoid the difficulties of using linear functions in the high dimensional 
feature space. 

SVR technique relies on kernel functions to construct the model. The commonly used 
kernel functions are: 


o Linear 

o Polynomial 
o RBF 

o Sigmoid 


While implementing the SVR technique, the user needs to select the appropriate kernel 
function. 
The selection of a kernel function is tricky and requires optimization techniques for the 
best selection. To do this, we define how much error is acceptable in our model and will 
find an appropriate line (or hyperplane in higher dimensions) to fit the data. 
Advantages 
o SVR is robust to the outliers. 
o SVR performs with lower computation compared to other regression techniques. 
o lts implementation is straightforward. 
Disadvantages 
Choosing an appropriate Kernel function is difficult and complex. 
The complexity and memory requirements of SVM are very high. 
One must do feature scaling of variables before applying SVM. 
SVM takes a long training time on large datasets. 
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